2 research outputs found
Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia
Neural machine translation (NMT) for low-resource local languages in
Indonesia faces significant challenges, including the need for a representative
benchmark and limited data availability. This work addresses these challenges
by comprehensively analyzing training NMT systems for four low-resource local
languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese. Our
study encompasses various training approaches, paradigms, data sizes, and a
preliminary study into using large language models for synthetic low-resource
languages parallel data generation. We reveal specific trends and insights into
practical strategies for low-resource language translation. Our research
demonstrates that despite limited computational resources and textual data,
several of our NMT systems achieve competitive performances, rivaling the
translation quality of zero-shot gpt-3.5-turbo. These findings significantly
advance NMT for low-resource languages, offering valuable guidance for
researchers in similar contexts.Comment: Accepted on SEALP 2023, Workshop in IJCNLP-AACL 202
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
We present NusaCrowd, a collaborative initiative to collect and unify
existing resources for Indonesian languages, including opening access to
previously non-public resources. Through this initiative, we have brought
together 137 datasets and 118 standardized data loaders. The quality of the
datasets has been assessed manually and automatically, and their value is
demonstrated through multiple experiments. NusaCrowd's data collection enables
the creation of the first zero-shot benchmarks for natural language
understanding and generation in Indonesian and the local languages of
Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual
automatic speech recognition benchmark in Indonesian and the local languages of
Indonesia. Our work strives to advance natural language processing (NLP)
research for languages that are under-represented despite being widely spoken